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HIGH-DIMENSIONAL ADDITIVE MODELING 

By Lukas Meier, Sara van de Geer and Peter Buhlmann 

ETH Zurich 

We propose a new sparsity-smoothness penalty for high-dimen- 
sional generalized additive models. The combination of sparsity and 
smoothness is crucial for mathematical theory as well as performance 
for finite-sample data. We present a computationally efficient algo- 
rithm, with provable numerical convergence properties, for optimizing 
the penalized likelihood. Furthermore, we provide oracle results which 
yield asymptotic optimality of our estimator for high dimensional but 
sparse additive models. Finally, an adaptive version of our sparsity- 
smoothness penalized approach yields large additional performance 
gains. 

1. Introduction. Substantial progress has been achieved over the last 
years in estimating high-dimensional linear or generalized linear models 
where the number of covariates p is much larger than sample size n. The 
theoretical properties of penalization approaches like the lasso [28] are now 
well understood [3, 14, 23, 24, 33] and this knowledge has led to several 
extensions or alternative approaches like adaptive lasso [34], relaxed lasso 
[22], sure independence screening [12] and graphical model based methods 
[6]. Moreover, with the fast growing amount of high-dimensional data in, for 
example, biology, imaging or astronomy, these methods have shown their 
success in a variety of practical problems. However, in many situations, the 
conditional expectation of the response given the covariates may not be 
linear. While the most important effects may still be detected by a linear 
model, substantial improvements are sometimes possible by using a more 
flexible class of models. Recently, some progress has been made regarding 
high-dimensional additive model selection [7, 19, 26] and some theoretical 
results are available [26]. Other approaches are based on wavelets [27] or can 
adapt to the unknown smoothness of the underlying functions [2]. 



Received December 2008; revised February 2009. 

AMS 2000 subject classifications. Primary 62G08, 62FI2; secondary 62J07. 
Key words and phrases. Group lasso, model selection, nonparametric regression, oracle 
inequality, penalized likelihood, sparsity. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2009, Vol. 37, No. 6B, 3779-3821. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



L. MEIER, S. VAN DE GEER AND P. BUHLMANN 



In this paper, we consider the problem of estimating a high-dimensional 
generalized additive model where p> n. An approach for high-dimensional 
additive modeling is described and analyzed in [26]. We use an approach 
which penalizes both the sparsity and the roughness. This is particularly 
important if a large number of basis functions is used for modeling the ad- 
ditive components. This is similar to [26] where the smoothness and the 
sparsity is controlled in the backfitting step. In addition, our computational 
algorithm, which builds upon the idea of a group lasso problem, has rigorous 
convergence properties and thus, it is provably correct for finding the opti- 
mum of a penalized likelihood function. Moreover, we provide oracle results 
which establish asymptotic optimality of the procedure. 

2. Penalized maximum likelihood for additive models. We consider high- 
dimensional additive regression models with a continuous response Y G R n 
and p covariates x^ , . . . G M. n connected through the model 

v 

Yi = c + Y^fj(xy') + Si, i = l,...,n, 
i=i 

where c is the intercept term, e% are i.i.d. random variables with mean zero 
and fj :R — ► R are smooth univariate functions. For identification purposes, 
we assume that all fj are centered, that is, 

n 

i=i 

for j = 1, . . . ,p. We consider the case of fixed design, that is, we treat the 
predictors sW,...,xW as nonrandom. 

With some slight abuse of notation we also denote by fj the n-dimensional 

vector (fj(x?), f j (x i J ) )) T . For a vector / G R n , we define = I £f =1 ff. 

2.1. The sparsity- smoothness penalty. In order to construct an estimator 
which encourages sparsity at the function level, penalizing the norms ||/j|| n 
would be a suitable approach. Some theory for the case where a truncated 
orthogonal basis with 0(n 1//5 ) basis functions for each component fj is used 
has been developed in [26]. 

If we use a large number of basis functions, which is necessary to be 
able to capture some functions at high complexity, the resulting estimator 
will produce function estimates which are too wiggly if the underlying true 
functions are very smooth. Hence, we need some additional control or restric- 
tions of the smoothness of the estimated functions. In order to get sparse and 
sufficiently smooth function estimates, we propose the sparsity-smoothness 
penalty 

J(fj)=^\\fj\\ 2 n + ^P(f j ), 
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where 



I 2 (fj) = J (//(*)) 



2 dx 



measures the smoothness of /,■ . The two tuning parameters Ai , A2 > control 
the amount of penalization. 

Our estimator is given by the following penalized least squares problem: 



(1) /i,...,/ p = argmin 

fi,~,f P eF 



Y-Y.fi 



where J- is a suitable class of functions and Y = (Yj., . . ., Y n ) T is the vector 
of responses. We assume the same level of regularity for each function fj. 
If Y is centered, we can omit an unpenalized intercept term and the nature 
of the objective function in (1) automatically forces the function estimates 
/1, . . . , f p to be centered. 

Proposition 1. Let a,kR such that a < min^x^ } and b > 

maxjj{xp^}. Let T be the space of functions that are twice continuously 

differ entiable on [a,b] and assume that there exist minimizers fj S T of (1). 

(i) 

Then the fj 's are natural cubic splines with knots at x\ , i = 1, . . . , n. 

A proof is given in Appendix A. Hence, we can restrict ourselves to the 
finite-dimensional space of natural cubic splines instead of considering the 
infinite-dimensional space of twice continuously differentiable functions. 

In the following subsection, we illustrate the existence and the computa- 
tion of the estimator. 



2.2. Computational algorithm. For each function fj, we use a cubic B- 
spline parameterization with a reasonable amount of knots or basis func- 
tions. A typical choice would be to use K — 4 x \fn interior knots that are 
placed at the empirical quantiles of x^ . Hence, we parameterize 

K 

fj( x ) = Y^h kb h k ^ x ^ 
k=i 

where bj^ : R — > R are the B-spline basis functions and [5j = ((3j i, ■ ■ ■ , f3j,K) T G 
R^ is the parameter vector corresponding to fj. Based on the basis func- 
tions, we can construct an n x pK design matrix B = \B1\B2\ ■ ■ ■ \B p ], where 
Bj is the n x K design matrix of the B-spline basis of the jth predictor, that 

is, Bj^ = bj,i{x { p). 
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For twice continuously differentiable functions, the optimization problem 

(1) can now be reformulated as 

(2) $= argmin \\Y - B(3\\ 2 n + XiJ2\l - if I if lij + X, >/<>• i r 

p=(jh,...,lb) J= i V n 

where the K x K matrix Qj contains the inner products of the second deriva- 
tives of the B-spline basis functions, that is, 

%,M = / b'lkix^j/x) dx 

for k,le{l,...,K}. 

Hence, (2) can be rewritten as a general group lasso problem [32] 

(3) 0= argmin \\Y - B /3\\ 2 n + J 0? MjPj , 

/3=(/3i,..,/3j,) j=1 

where Mj = ^BjBj + A2%. By decomposing (e.g., using the Cholesky de- 
composition) Mj = RjRj for some quadratic K x K matrix Rj and by 
defining (3j = Rjflj, Bj = BjRj 1 , (3) reduces to 

(4) |= argmin ||F - B0\\l + Ai ||&||, 

P=01,-;Pp) j=l 

where \\(3j\\ = \^K\\f3j\\K is the Euclidean norm in M. K . This is an ordinary 
group lasso problem for any fixed A2, and hence the existence of a solution 
is guaranteed. For Ai large enough, some of the coefficient groups (3j £ M. K 
will be estimated to be exactly zero. Hence, the corresponding function 
estimate will be zero. Moreover, there exists a volue \± max < 00 such that 

$1 = ■ ■ ■ = (3 p = for Ai > Ai jmax . This is especially useful to construct a grid 
of Ai candidate values for cross-validation (usually on the log-scale). 

Regarding the uniqueness of the identified components, we have equivalent 
results as for the lasso. Define S((3; B) = \\Y — Bft^. Similar to [25], we have 
the following proposition. 

Proposition 2. If pK < n, and if B has full rank, a unique solution 
of (4) exists. If pK > n, there exists a convex set of solutions of (4)- More- 
over, if || Vg.<S(/3; B)\\ < Xi, then (3j = and all other solutions /3 ther satisfy 

P other, j ^' 

A proof can be found in Appendix A. 

By rewriting the original problem (1) in the form of (4), we can make 
use of already existing algorithms [16, 21, 32] to compute the estimator. 
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Fig. 1. True functions fj (solid) and estimated functions fj (dashed) for the first 6 com- 
ponents of a simulation run of Example 1 in Section 3. Small vertical bars indicate original 
data and grey vertical lines knot positions. The dotted lines are the function estimates when 
no smoothness penalty is used, that is, when setting A2 = 0. 



Coordinate- wise approaches as in [21, 32] are efficient and have rigorous 
convergence properties. Thus, we are able to compute the estimator exactly, 
even if p is very large. 

An example of estimated functions, from simulated data according to 
Example 1 in Section 3, is shown in Figure 1. For illustrational purposes, we 
have also plotted the estimator which involves no smoothness penalty (A2 = 
0). The latter clearly shows that for this example, the function estimates are 
"too wiggly" compared to the true functions. As we will also see later, the 
smoothness penalty plays a key role for the theory. 

Remark 1. Alternative possibilities of our penalty would be to use ei- 
ther (i) J(fj) = AxH/jIU + X 2 I(f j ) or (ii) J(fj) = Ai||/j|| n + ^Uj). Both 
approaches lead to a sparse estimator. While proposal (i) also enjoys nice 
theoretical properties (see also Section 5.2), it is computationally more de- 
manding, because it leads to a second order cone programming problem. 
Proposal (ii) basically leads again to a group lasso problem, but appears 
to have theoretical drawbacks, that is, the term \2l 2 (fj) is really needed 
within the square root. 

2.3. Oracle results. We present now an oracle inequality for the penalized 
estimator. The proofs can be found in Appendix A. 
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For the theoretical analysis, we introduce an additional penalty parameter 
A3 > for technical reasons. We consider, here, a penalty of the form 

^) = AiV / |l/ill^ + A2/ 2 (/ J ) + A 3 / 2 (/ J ). 

This penalty involves three smoothing parameters Ai, A2 and A3. One may 
reduce this to a single smoothing parameter by choosing 

A2 = A3 = \f, 

(see Theorem 1 below). In the simulations however, the choice A3 = turned 
out to provide slightly better results than the choice A2 = A3. With A 3 = 0, 
the theory goes through provided the smoothness I(fj) remains bounded in 
an appropriate sense. 

We let f° denote the "true" regression function (which is not necessarily 
additive), that is, we suppose the regression model 

Y i = f(x i )+e i , 

where xi = (x^ , . . . , x[ p ^) T for i = I, . . . ,n, and where £1, . . . , e n are inde- 
pendent random errors with E[ej] = 0. Let /* be a (sparse) additive approx- 
imation of /° of the form 

r(x i ) = c * + f]/;(x? ) ), 
J'=l 

where we take c* = E[Y], Y = Ya=i Yi/n. The result of this subsection (The- 
orem 1) holds for any such /* satisfying the compatibility condition below. 
Thus, one may invoke the optimal additive predictor among such /*, which 
we will call the "oracle." For an additive function /, the squared distance 
11/ ~~ f°\\n can be decomposed into 

11/ ~~ /Iln = 11/ ~~ /add Iln + 1 1 /add ~~ / Iln) 

where /° dd is the projection of /° on the space of additive functions. Thus, 
when /° is itself not additive, the oracle can be seen as the best sparse 
approximation of the projection _f° dd of /°. 
The active set is defined as 

(5) A = {j:||/;||n/0}. 

We define, for j = 1,. . . ,p, 

r 2 n(h) = \\f 3 \\l + \ 2 -"I 2 (h). 

Moreover, we let < rj < 1 be some fixed value. The constant 4/(1 — 77) 
appearing below in the compatibility condition is stated in this form to 
facilitate reference, later in the proof of Theorem 1. 
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We will use a compatibility condition, in the spirit of the incoherence 
conditions used for proving oracle inequalities for the standard lasso (see, 
e.g., [3, 8, 9, 10, 30]). To avoid digressions, we will not attempt to formulate 
the most general condition. A discussion can be found in Section 5.1. 

Compatibility condition. For some constants < rj < 1 and < 
4>n,* < 1> and for all {fj} P j=i satisfying 

3=1 1 jeA* 



the following inequality is met: 



v 

£/, 

3=1 



+ A 2 ^£l 2 (/ j 



b 2 



For practical applications, the compatibility condition cannot be checked 
because the set .4* is unknown. 

Consider the general case where I is some semi-norm, for example, as in 
Section 2.1. For mathematical convenience, we write 

(6) fj = gj + hj 

with gj and hj centered and orthogonal functions, that is, 

n n 

£*,i = £^,i = o 

i=l i=l 

and 



i=l 



such that I(hj) = and I(gj) = I{fj)- The functions hj are assumed to lie 
in a d-dimensional space. The entropy of ({gj -I(gj) = 1}, || • || n ) is denoted 
by Hj(-); see, for example, [29]. We assume that for all j, 

(7) Hj(6)<A6- 2 ^- a) , 5>0, 

where < a < 1 and A > are constants. When I 2 {fj) = J {fj{x)) 2 dx, the 
functions hj are the linear part of fj, that is, d=l. Moreover, one then has 
a = 3/4 (see, e.g., [29], Lemma 3.9). 

Finally, we assume sub-Gaussian tails for the errors: for some constants 
L and M, 

(8) maxE[exp(e 2 /L)] < M. 
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The next lemma presents the behavior of the empirical process. We use 
the notation (e, f) n = ^Y^?=i £ if( x i) f° r the inner product. Define 

(9) s = s 1 ns 2 ns 3 , 



where 



f / 2(e, 9j ) n 

5 1 = I maxsup - n an L a , \ < Cn 

I ■> 9j \\\9j\\nl l a {9j)J 

5 2 = < maxsup f — < £ n \ 

I 3 \ \\n J J 



and 



1 n 

S 3 = {e<^ n }, e = -Vei. 

n . • 

i=i 

For an appropriate choice of £ n , the set S has large probability. 



Lemma 1. Assume (7) and (8). There exist constants c and C depending 
only on d, a, A, L and M , such that for 



'logp 



n 



one has 

P(S)>l- C exp[-< 2 /c 2 ]. 

For a € (0, 1), we define its "conjugate" 7 = 2(1 — a)/(2 — a). Recall that 
when I 2 (fj) = J (fj{x)) 2 dx, one has a = 3/4, and hence 7 = 2/5. 

We are now ready to state the oracle result for / = c + Y^=i fj as defined 
in (1), with c = Y . 

Theorem 1. Suppose the compatibility condition is met. Take for j = 1, 
...,p, 



J(fj) = AiV / ||/ J ||2+A 2 /2(/ J ) + A 3 / 2 (/ i ) 

with Ai = A^ 2 " 7 )/ 2 and A2 = A3 = A 2 , and with £ n y/2/T) < A < 1. TTien on i/ie 
set 5 gfyen in f5/ ; it holds that 

11/ " /SullS + 2(1 - n)A^)/ 2 £ r n (fj " #) + A 2 ^ £ I 2 ^) 



<3|ir-/ a ° dd ll^ + 3A 2 - 7 £ / 2 (/;) + 



,2 
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The result of Theorem 1 does not depend on the number of knots (basis 
functions) which are used to build the functions fj, as long as fj and /* use 
the same basis functions. 

We would like to point out that the theory of Theorem 1 goes through with 
only two tuning parameters Ai and A2, but with the additional restriction 
that I(fj) is appropriately bounded. 

We also remark that we did not attempt to optimize the constants given 
in Theorem 1, but rather looked for a simple explicit bound. 

Remark 2. Assume that (p n ^ is bounded away from zero. For example, 
this holds with large probability for a realization of a design with inde- 
pendent components (see Section 5.1). In view of Lemma 1, one may take 
(under the conditions of this lemma) the smoothing parameter A of order 
s/logp/n. For I 2 {fj) = J (fj(x)) 2 dx, 7 = 2/5 and this gives A 2 " 7 of order 
(logp/n) 4 / 5 , which is up to the log-term the usual rate for estimating a twice 
differentiable function. If the oracle /* has bounded smoothness I(fj) for all 

j, Theorem 1 yields the convergence rate p act (logp/n) 4 / 5 , with p act = \A*\ 
being the number of active variables the oracle needs. This is again up to the 
log-term, the same rate one would obtain if it was known beforehand which 
of the p functions are relevant. For general (j> n *, we have the convergence 
rate p a ct</v* (logp/n) 4/5 . 

Furthermore, the result implies that with large probability, the estimator 
selects a sup-set of the active functions, provided that the latter have enough 
signal (such kind of variable screening results have been established for the 
lasso in linear and generalized linear models [24, 30]). More precisely, we 
have the following corollary. 

Corollary 1. Let Ao = {j: H/addjIU ^ °i be tfie ac ^ ve set °f /add- 
Assume the compatibility condition holds for Ao, with constant 4>n,o- Suppose 
also that for j G Ao, the smoothness is bounded, say /(/^ »•) < 1- Choosing 
f* = f® dd in Theorem 1, tells us that on S, 

E Wfj - ZaVjn < CX^ 2 \A \/Cbl + 2f n 

for some constant C . Hence, if 

11/addJU > CA^Vol/^o + 2& 3 G Ao, 
we have (on S), that the estimated active set {j : ||/j|| n 7^ 0} contains Ao- 
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2.4. Comparison with related results. After an earlier version of this pa- 
per, similar results have been published in [17]. Here, we point out some 
differences and similarities between our work and [17]. 

In [17], the framework of reproducing kernel hilbert spaces (RKHS) is 
considered, as for example, used in COSSO [19], while we use penalties based 
on smoothness seminorms. Hence, the two frameworks are rather different, 
at least from a mathematical point of view. The results in [17] are valid for 
a large class of loss functions, although we would like to point out that the 
quadratic loss as studied here is not covered in [17] since they assume that 
the loss function is appropriately bounded. 

The oracle result and the conditions in [17] are similar to our Theorem 
1. Regarding the convergence rate (see Remark 2), the rates obtained in 
[17] are similar in spirit to ours. In [17], the rate is slower than ours if the 
"smoothness" (3 is equal to 2. Moreover, "smoothness" in [17] is very much 
intertwined with the unknown distribution of the covariables, whereas in our 
work "smoothness" is defined, for example, in terms of Sobolev-norms. 

Compared to the work in [17], and, for example, COSSO [19], we gain 
flexibility through the introduction of the additional penalty parameter A2 
for (separately) controlling the smoothness. In addition, we present an al- 
gorithm in Section 2.2 which is efficient with mathematically established 
convergence results. 



3. Numerical examples. 



3.1. Simulations. In this section, we investigate the empirical properties 
of the proposed estimator. We compare our approach with the boosting 
approach of [7], where smoothing splines with low degrees of freedom are 
used as base learners; see also [5]. For p= 1, boosting with splines is known 
to be able to adapt to the smoothness of the underlying true function [7]. 
Generally, boosting is a very powerful machine learning method and a wide 
variety of software implementations are available, for example, the R add-on 
package mboost. 

We use a training set of n samples to train the different methods. An 
independent validation set of size [n/2j is used to select the prediction 
optimal tuning parameters Ai and A2. We use grids (on the log-scale) for 
both Ai and A2, where the grid for Ai is of size 100 and the grid for A2 is 
typically of about size 15. For boosting, the number of boosting iterations 
is used as tuning parameter. The shrinkage factor v and the degrees of 
freedom df of the boosting procedure are set to their default values v = 0.1 
and df = 4; see also [5]. 

By SNR, we denote the signal-to-noise ratio, which is defined as 

SNR= V "'^», 
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where / = f° : R p — > R is the true underlying function. 

A total of 100 simulation runs are used for each of the following settings. 

3.1.1. Models. We use the following simulation models. 

Example 1 (n = 150, p = 200, p act = 4, SNR « 15). This example is 
similar to Example 1 in [26] and [15]. The model is 



Yi = h(xf>) + h{xf>) + h{xf>) + U{xf>) + 6i, Si i.i.d. iV (0, 1), 
with 

f 1 (x) = -sm(2x), h{x) = x 2 2 - 25/12, fs(x)=x, 
/ 4 (x) = e - a; -2/5-sinh(5/2). 

The covariates are simulated from independent Uniform(— 2.5, 2.5) distribu- 
tions. The true and the estimated functions of a simulation run are illus- 
trated in Figure 1. 

Example 2 (n = 100, p = 1000, p act = 4, SNR « 6.7). As above but 
high dimensional and correlated. The covariates are simulated according to a 
multivariate normal distribution with covariance matrix = O.S'* - - 7 ' ; z, j = 
l,...,p. 

Example 3 [n = 100, p = 80, p act = 4, SNR w 9 (t = 0), « 7.9 (t = 1)]. 
This is similar to Example 1 in [19] but with more predictors. The model is 

Y i = 5f 1 (x\ 1) ) + 3f 2 (xP) + 4f 3 (x[ 3) ) + 6U(x\ A) ) + e ii i.i.d. iV(0,1.74), 



with 



h(x) = x, h(x) = (2s - l) 2 , / 3 (x)- 



2 — sin(27rx) 
and 

/ 4 (x) =0.1sin(27rx) +0.2cos(2vrx) + 0.3sin 2 (2vrx) 
+ 0.4cos 3 (2vrx) +0.5sin 3 (2vrx). 
The covariates x = (x^ l \ . . . ,x^) T are simulated according to 

m W^ + tU . . 

x u ' = , 7 = l,...,p, 

1 + t ' J ' 

where and ?7 are i.i.d. Uniform(0, 1). For £ = this is the 

independent uniform case. The case t = 1 results in a design with correlation 
0.5 between all covariates. 
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0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 



x 1 x 2 x 3 




0.1 0.3 0.5 0.7 0.9 0.0 0.2 0.4 6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

X4 x 5 x 6 

Fig. 2. True functions fj (solid) and estimated functions fj (dashed) for the first 6 
components of a simulation run of Example 3 (t — 0). Small vertical bars indicate original 
data and grey vertical lines knot positions. The dotted lines are the function estimates 
when no smoothness penalty is used, that is, when setting A2 = 0. 



The true functions and the first 6 estimated functions of a simulation run 
with t = are illustrated in Figure 2. 

Moreover, we also consider a "high-frequency" situation where we use 
fs(8x) and /4(4x) instead of fs(x) and fn{x). The corresponding signal- 
to-noise ratios for these models are SNR~ 9 for t = and SNR ~ 8.1 for 
t = 1. 

Example 4 [n = 100, p = 60, p act = 12, SNR«9 = 0), « 11.25 = 1)]. 
This is similar to Example 2 in [19] but with fewer observations. We use the 
same functions as in Example 3. The model is 

Y t = hix^) + Hxf ] ) + h{xf ] ) + / 4 (4 4) ) 

+ 1.5/i (xf } ) + l-5/ 2 (xf >) + 1.5/ 3 K {7) ) + 1.5/ 4 (xS 8) ) 
+ 2/ 1 (xf ) ) + 2/ 2 (xf 0) ) + 2h{xf l) ) + 2U{xf 2) ) + 8, 
with £j i.i.d. iV(0, 0.5184). The covariates are simulated as in Example 3. 

3.1.2. Performance measures. In order to compare the prediction per- 
formances, we use the mean squared prediction error 



PE = E x [(f(X)-f(X)) 2 } 
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Table 1 

Results of the different simulation models. 

Reported is the mean of the ratio of the 
prediction error of the two methods. SSP: 
sparsity- smoothness penalty approach, boost, 
boosting with smoothing splines. Standard 
deviations are given in parentheses 



Model PE S sp/PE boost 



Example 1 






0.93 (0.13) 


Example 2 






0.96 (0.10) 


Example 3 (t = 0) 






0.81 (0.13) 


Example 3 (t = 1) 






0.90 (0.19) 


Example 3 "high-freq" 


(* = 


0) 


0.65 (0.11) 


Example 3 "high-freq" 


(* = 


1) 


0.57 (0.10) 


Example 4 (t = 0) 






0.89 (0.10) 


Example 4 (t = 1) 






0.88 (0.13) 



as performance measure. The above expectation is approximated by a sample 
of 10,000 points from the distribution of X. In each simulation run, we 
compute the ratio of the prediction performance of the two methods. Finally, 
we take the mean of the ratios over all simulation runs. 

For variable selection properties, we use the number of true positives 
(TP) and false positives (FP) at each simulation run. We report the average 
number over all runs to compare the different methods. 

3.1.3. Results. The results are summarized in Tables 1 and 2. The sparsity- 
smoothness penalty approach (SSP) has smaller prediction error than boost- 
ing, especially for the "high-frequency" situations. Because the weak learners 
of the boosting method only use 4 degrees of freedom, boosting tends to ne- 
glect or underestimate those components with higher oscillation. This can 
also be observed with respect to the number of true positives. By relax- 
ing the smoothness penalty (i.e., choosing A2 small or setting A2 = 0), SSP 
is able to handle the high-frequency situations, at the cost of too wiggly 
function estimates for the remaining components. Using a different amount 
of regularization for sparsity and smoothness, SSP can work with a large 
amount of basis functions in order to be flexible enough to capture sophis- 
ticated functional relationships and, on the other side, to produce smooth 
estimates if the underlying functions are smooth. 

With the exception of the high-frequency examples, the number of true 
positives (TP) is very similar for both methods. There is no clear trend with 
respect to the number of false positives (FP). 

3.2. Real data. In this section, we would like to compare the different 
estimators on real data sets. 
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3.2.1. Tecator. The meatspec data set contains data from the Tecator 
Infratec Food and Feed Analyzer. It is, for example, available in the R add- 
on package faraway and on StatLib. The p = 100 predictors are channel 
spectrum measurements, and are therefore highly correlated. A total of n = 
215 observations are available. 

The data is split into a training set of size 100 and a validation set of 
size 50. The remaining data are used as test set. On the training dataset, 
the first 30 principal components are calculated, scaled to unit variance and 
used as covariates in additive modeling. Moreover, the validation and test 
data sets are transformed to correspond to the principal components of the 
training data set. We fit an additive model to predict the logarithm of the fat 
content. This is repeated 50 times. For each split into training and test data, 
we compute the ratio of the prediction errors from the SSP and boosting 
method on the test data, as in Section 3.1.2. The mean of the ratio over the 
50 splits is 0.86, the corresponding standard deviation is 0.46. This indicates 
superiority of our sparsity-smoothness penalty approach. 

3.2.2. Motif regression. In motif regression problems [11], the aim is 
to predict gene expression levels or binding intensities based on informa- 
tion on the DNA sequence. For our specific dataset, from the Ricci lab at 
ETH Zurich, we have binding intensities Yi of a certain transcription factor 
(TF) at 287 regions on the DNA. Moreover, for each region i, motif scores 
x\ , • • • , x^ , p = 196 are available. A motif is a candidate for the binding site 

of the TF on the DNA, typically a 5-15bp long DNA sequence. The score 
(i) 

x\ measures how well the jth motif is represented in the ith region. The 
candidate list of motifs and their corresponding scores were created with a 
variant of the MDScan algorithm [20]. The main goal here is to find the 
relevant covariates. 



Table 2 

Average values of the number of true (TP) and false (FP) positives. Standard deviations 

are given in parentheses 



Model 








TPssp 


FPssp 


TP boost 


FPboost 


Example 1 








4.00 (0.00) 


24.30 (14.11) 


4.00 (0.00) 


22.18 (12.75) 


Example 2 








3.47 (0.61) 


34.37 (17.38) 


3.60 (0.64) 


28.76 (20.15) 


Example 3 (t = 


0) 






4.00 (0.00) 


20.20 (9.30) 


4.00 (0.00) 


21.61 (10.90) 


Example 3 (t = 


1) 






3.93 (0.29) 


19.28 (9.61) 


3.92 (0.27) 


18.65 (8.35) 


Example 3 "high-freq" 


(* = 


0) 


2.80 (0.78) 


12.26 (7.61) 


2.16 (0.94) 


9.23 (9.74) 


Example 3 "hi;; 


;h-freq" 


(* = 


1) 


2.46 (0.85) 


11.17 (8.50) 


1.59 (1.27) 


13.24 (13.89) 


Example 4 (t = 


0) 






11.69 (0.56) 


21.23 (6.85) 


11.68 (0.57) 


25.91 (9.43) 


Example 4 (t = 


1) 






10.64 (1.15) 


19.78 (7.51) 


10.67 (1.25) 


23.76 (9.89) 
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7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 

Motif.P1. 6.23 Motif.P1. 6.26 



Fig. 3. Estimated functions fj of the two most stable motifs. Small vertical bar indicate 
original data. 



We used 5 fold cross-validation to determine the prediction optimal tun- 
ing parameters, yielding 28 active functions. To assess the stability of the 
estimated model, we performed a nonparametric bootstrap analysis. At each 
of the 100 bootstrap samples, we fit the model with the fixed optimal tuning 
parameters from above. The two functions which appear most often in the 
bootstrapped model estimates are depicted in Figure 3. While the left-hand 
side plot shows an approximate linear relationship, the effect of the other 
motif seems to diminish for larger values. Indeed, Motif . PI . 6 . 26 is the 
true (known) binding site. A follow-up experiment showed that the TF does 
not directly bind to Motif . PI . 6 . 23. Hence, this motif is a candidate for 
a binding site of a co-factor (another TF) and needs further experimental 
validation. 



4. Extensions. 



4.1. Generalized additive models. Conceptually, we can also apply the 
sparsity-smoothness penalty from Section 2 to generalized linear models 
(GLM) by replacing the residual sum of squares \\Y — Y%=i fj\\ n by the 
corresponding negative log-likelihood function. We illustrate the method for 
logistic regression where Y G {0, 1}. The negative log-likelihood as a function 
of the linear predictor r/ and the response vector Y is 

1 - 

£( V ,Y) = —J2[YiVi ~ log{l +exp(7fc)}], 
i=i 
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Table 3 

Results of different model sizes p. Reported is the 
mean of the ratio of the prediction error of the 
two methods. SSP: sparsity-smoothness penalty 

approach, boost: boosting with smoothing splines. 
Standard deviations are given in parentheses 



p 


PEssp/PE boost 


250 


0.93 (0.06) 


500 


0.96 (0.07) 


1000 


0.98 (0.05) 



where rji = c + J2*j=i fj( x i)- The estimator is defined as 

(10) cj u ...,f p = argmin i[ c + ]T fj, Y ) + ]T J(fj). 

cGK,/i,...,/ p e^ V j=l J 3=1 

This has a similar form as (1) with the exception that we have to explicitly 
include a (nonpenalized) intercept term c. Using the same arguments as in 
Section 2, leads to the fact that for twice continuously differentiable func- 
tions, the solution can be represented as a natural cubic spline and that (10) 
leads again to a group lasso problem. This can, for example, be minimized 
with the algorithm of [21]. We illustrate the performance of the estimator 
in a small simulation study. 



4.1.1. Small simulation study. Denote by / : M. p — > M the true function 
of Example 2 in Section 3. We simulate the the linear predictor r\ as 

= 1.5 -(2 + /(*)), 

where X £ W has the same distribution as in Example 2. The binary re- 
sponse Y is then generated according to a Bernoulli distribution with prob- 
ability 1/(1 + exp(— r](X)), which results in a Bayes risk of approximately 
0.17. The sample size n is set to 100. The results for various model sizes p 
are reported in Tables 3 and 4. The performance of the two methods is quite 
similar. SSP has a slightly lower prediction error. Regarding model selection 
properties, SSP has fewer false positives at the cost of slightly fewer true 
positives. 

4.2. Adaptivity. Similar to the adaptive lasso [34], we can also use dif- 
ferent penalties for the different components, that is, use a penalty of the 
form 

J(fj) = ^yju\\fj\\n + ^w 2j P(fj), 
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where the weights w±j and W2j are ideally chosen in a data-adaptive way. 
If an initial estimator init is available, a choice would be to use 

1 1 



wij = — ~ — , w 2 , 



/j,init||n I(fj,imt)^ 

for some 7 > 0. The estimator can then be computed similarly as described 
in Section 2.2. This allows for different degrees of smoothness for different 
components. 

We have applied the adaptive estimator to the simulation models of Sec- 
tion 3. In each simulation run, we use weights (with 7 = 1) based on the ordi- 
nary sparsity-smoothness estimator. For comparison, we compute the ratio 
of the prediction error of the adaptive and the ordinary sparsity-smoothness 
estimator at each simulation run. The results are summarized in Table 5. 
Both the prediction error and the number of false positives can be decreased 
by a good margin in all examples. The number of true positives gets slightly 
decreased in some examples. 

5. Mathematical theory. 

Table 4 

Average values of the number of true (TP) and false (FP) positives. Standard deviations 

are given in parentheses 

P TPsSP FPsSP TPboost FPboost 

250 2.94 (0.71) 22.81 (10.56) 3.09 (0.78) 29.67 (14.91) 

500 2.56 (0.82) 24.92 (12.47) 2.80 (0.82) 31.41 (17.28) 

1000 2.36 (0.84) 26.45 (14.88) 2.61 (0.71) 33.69 (19.54) 



Table 5 

Results of the different simulation models. Reported is the mean of the ratio of the 
prediction error of the two methods and the average values of the number of true (TP) 
and false (FP) positives. SSP; adapt: adaptive sparsity-smoothness penalty approach, 
SSP: ordinary sparsity-smoothness penalty approach. Standard deviations are given in 

parentheses 



Model 






PEsSP; adapt /PEsSP 


TP 


FP 


Example 1 






0.47 (0.13) 


4.00 (0.00) 


4.09 (4.63) 


Example 2 






0.63 (0.17) 


3.31 (0.71) 


6.12 (5.14) 


Example 3 (t = 0) 






0.53 (0.14) 


4.00 (0.00) 


4.64 (4.52) 


Example 3 (t = 1) 






0.63 (0.22) 


3.81 (0.46) 


5.04 (4.82) 


Example 3 "high-freq" 


(* = 


0) 


0.87 (0.09) 


2.28 (0.78) 


2.98 (2.76) 


Example 3 "high-freq" 


(* = 


1) 


0.91 (0.10) 


1.69 (0.73) 


2.59 (3.30) 


Example 4 (t = 0) 






0.77 (0.11) 


11.21 (0.84) 


8.18 (5.04) 


Example 4 (t = 1) 






0.88 (0.12) 


9.73 (1.29) 


7.93 (5.35) 
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5.1. On the compatibility condition. We show in this subsection that the 
compatibility condition holds under reasonable conditions when 




is the Sobolev norm (fj being the sth derivative of fj), and when in ad- 
dition, the X % = (X^ , . . . , x\ p) ) are i.i.d. copies of a p-dimensional random 
variable X G [0, l] p with distribution Q. Then, the compatibility condition 
may be replaced by a theoretical variant, where the norm || • || n is replaced 
by the theoretical L2(Q)-norm || • ||. The theoretical compatibility condi- 
tion (given below) is not about n-dimensional vectors, but about functions. 
In that sense, the sample size n plays a less prominent role. For exam- 
ple, the theoretical compatibility condition is satisfied when the components 
are independent. 

The main assumption to make the replacement by a theoretical version 
possible, is the requirement that 

[with 7 = 2/(2s + 1)] is small in an appropriate sense [see (11)]. This is 
comparable to the condition X\A*\ being small, for the ordinary lasso (see, 
e.g., [9]). In fact, our approach for the transition from fixed to random design 
may also shed new light on the same transition for the lasso. 

Let X = (Jf W , . . . , X&) £ [0, 1} P have distribution Q, and let X lt . . . , X n 
be i.i.d. copies of X. The marginal distribution of X^ is denoted by Qj. 
We write 

\\ft = J fdQ 

and for a function fj depending only on the jth variable X^\ 

fj 1 Jfj'IQr 
In this subsection, we assume all /j's are centered: 

J fj d Qj = Q, 3 = !,•■•, V- 

Recall the notation 

^(/;)HI/^ + A 2 -^ 2 a,)- 
We now also define the theoretical counterparts 

T\fj) = \\f 3 f + \ 2 -n\fj) 
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and write 

Ttot(/) = Tin(f) + W(/)> 

r in (/)= E <fi)> r out (f)= E r(fj). 
jeA* j£A* 

One now may actually redress the proofs for the oracle inequality directly, 
in order to handle random design. This will generally lead to better constants 
as the approach that we now take, which is showing that the conditions for 
fixed design hold with large probability. The advantage of this detour is 
however that we do not have to repeat the main body of the proof. 

The theoretical compatibility condition is of the same form as the empir- 
ical one, but with different constants. 



Theoretical compatibility condition. For a constant < rj < 1 
and <(/>*< 1 , and for all / satisfying 

Vtot(/) < C v T m (f), 

where 

_ 4(1 + 77) 
Cr > ~ (1 _ v )2 > 

we have 

E ii/iii 2 <(ii/ii 2 +A 2 -^E /2 (/i: 



Note that the theoretical compatibility condition trivially holds when the 
components of X are independent. However, independence is not a necessary 
condition: much broader schemes are allowed. 

Let Co be a constant and 

9 -f, im HI/lln-ll/ll 2 l <r a 1 - 

O4 — < SUP n , . S 

I / Ttot(/) 

In Appendix B, we show that for an appropriate value of A, 1S4 has large 
probability, for a constant Co depending only on s, and on an assumed lower 
bound for the marginal densities of the In fact, it turns out that one 

can take A of order y/logp/n under weak conditions, assuming /(•) is the 
Sobolev norm. 



Theorem 2. Assume 
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Then on S4, the theoretical compatibility condition implies the empirical one 
as given in Section 2.3, with constant 

As previously mentioned, condition (11) implies that the number of active 
components cannot grow too fast in order for lA^A 1-7 being small. 

We now have a quick closer look at the theoretical compatibility condi- 
tion. The following two conditions are sufficient and might yield some more 
insight. 



Well-conditioned active set condition. We say that the active 
set ^4* is well conditioned if for some constant < < 1, and for all 

{fj}jeA«i 



E m 

jeA* 



< 



E fi 

jeA* 



> 2 



The inner product in L2(Q) between functions / and / is denoted by 
(/,/). No perfect canonical dependence in our setup amounts to the follow- 
ing condition. 



NO PERFECT CANONICAL DEPENDENCE CONDITION. We Say that the 

active and nonactive variables have no perfect canonical dependence, if for a 
constant < p* < 1, and all {/j}j = i, we have for f m = Y^jeA* h an< ^ /out = 
Yuj^A* fj-> that 

I (/in) /out) I ^ 
H/inll 1 1 /out 1 1 

The next lemma makes the link between the theoretical compatibility 
condition and the above two conditions. 

Lemma 2. Let f = f in + / out satisfy 

I (/in; /out) I 



< p* < I- 



II J in M 1 1 J out I 

Then 



||/in|| 2 <||/|| 2 /(l-^)- 
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Proof. Clearly, 



||/m|| 2 <||/|| 2 + 2|(/ in ,/ out )|-||/ ( 



out 



2 



Hence 



ll/inll 2 < || /|| 2 + 2^11/inlHI/outll - ||/ out || 2 < Il/H 2 + P l\\f in \\ 2 . □ 



Corollary 2. A well-conditioned active set in combination with no 
perfect canonical dependence implies the theoretical compatibility condition 



Remark 3. Canonical dependence is about the dependence structure 
of variables. To compare, let X[ n and X ont be two random variables, with 
joint density q, and with marginal densities qi n and g ut- Define for real- 
valued measurable functions f m and /out, of X m and X out , respectively, 
the squared norms ||/i n || 2 = / /£<?in, and ||/ ou t|| 2 = / /output, and the inner 
product (/ in , /out) = / /in/out<7- Assume the functions are centered: / /i n g in = 
/ /output = 0. Suppose that for some constant p*, 



Then one can easily verify that |(/i n ,/out)| < P*||/in||||/out||- In other words, 
the no perfect canonical dependence condition is in this context the assump- 
tion that the density and the product density are, in x 2 -sense, not too far 



5.2. On the choice of the penalty. In this paper, we have chosen the 
penalty in such a way that it leads to good theoretical behavior (namely the 
oracle inequality of Theorem 1), as well as to computationally fast, and in 
fact already existing, algorithms. The penalty can be improved theoretically, 
at the cost of computational efficiency and simplicity. 

Indeed, a main ingredient from the theoretical point of view is that the 
randomness of the problem (the behavior of the empirical process) should 
be taken care of. Let us recall Lemma 1 which says that the set S has large 
probability, and on S all functions gj satisfy 



Our penalty was based on the inequality (which holds for any a and b posi- 
tive) 






off. 



M;)n<U*-|ln^" a fe)- 



a a b l ~ a < Va 2 + b 2 . 
More generally, it holds for any q > 1 that 



a a b Y - a <(a q + b q ) 1/q 
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In particular, the choice q = 1 would be a natural one, and would lead to an 
oracle inequality involving I(fj) instead of the square I 2 {f*) on the right- 
hand side in Theorem 1. The penalty A^ 2 " 7 ^ 2 £f=i \\fj \\ n + A 2 " 7 £? =1 1(fj), 
corresponding to q = 1, still involves convex optimization but which is much 
more involved and hence less efficient to be solved; see also Remark 1 in 
Section 2.2. 

One may also use the inequality 

This leads to a "theoretically ideal" penalty of the from A 2-7 Y%=i I J (fj) + 
A£j =1 ||/ij|| n , where /ij is from (6). It allows to adapt to small values of 
/(/?)• But clearly, as this penalty is nonconvex, it may be computationally 
cumbersome. On the other hand, iterative approximations might prove to 
work well. 

6. Conclusions. We present an estimator and algorithm for fitting sparse, 
high-dimensional generalized additive models. The estimator is based on a 
penalized likelihood. The penalty is new, as it allows for different regu- 
larization of the sparsity and the smoothness of the additive functions. It 
is exactly this combination which allows to derive oracle results for high- 
dimensional additive models. We also argue empirically that the inclusion of 
a smoothness-part into the penalty function yields much better results than 
having the sparsity-term only. Furthermore, we show that the optimization 
of the penalized likelihood can be written as a group lasso problem and 
hence, efficient coordinate-wise algorithms can be used which have provable 
numerical convergence properties. 

We illustrate some empirical results for simulated and real data. Our new 
approach with the sparsity and smoothness penalty is never worse and some- 
times substantially better than L2-boosting for generalized additive model 
fitting [5, 7]. Furthermore, with an adaptive sparsity-smoothness penalty 
method, large additional performance gains are achieved. With the real data 
about motif regression for finding DNA-sequence motifs, one among two se- 
lected "stable" variables is known to be true, that is, it corresponds to a 
known binding site of a transcription factor. 

APPENDIX A: PROOFS 

Proof of Proposition 1. Because of the additive structure of / and 
the penalty, it suffices to analyze each component /,-, j = 1, . . . ,p indepen- 
dently. Let /i,...,/ p bea solution of (1) and assume that some or all fj are 

(i) 

not natural cubic splines with knots at x\ , i = 1, . . . , n. By Theorem 2.2 in 
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[13], we can construct natural cubic splines gj with knots at xp , i = 1, . . . ,n 
such that 

fc^) = /,•(*?>) 
for i = 1, . . . , n and j = 1, . . . ,p. Hence, 



and 



p 


2 


P 








3=1 


n 












l = \ 


/j'lln- 



But by Theorem 2.3, in [13], I 2 (gj) < I 2 (fj). Therefore, the value in the 
objective function (1) can be decreased. Hence, the minimizer of (1) must 
lie in the space of natural cubic splines. □ 

Proof of Proposition 2. The first part follows because of the strict 
convexity of the objective function. Consider now the case pK > n. The 

(necessary and sufficient) conditions for j3 to be a solution of the group lasso 
problem (4) are [32] 

\\V k S0-B)\\ = \ l for 4^0, 

||V0.SGM)||<Ai for 4 = 0. 

Assume that there exist two solutions (3^ and (3^ such that, for a compo- 
nent j, we have /3 { p = with ||Vg B)\\ < Ai, but fif / 0. Because 
the set of all solutions is convex, 

| p = (l-p)|«+fl0< 2 > 

is also a minimizer for all p G [0,1]. By assumption /3 p j ^ 0, and hence 
\\Vp.S0 p ;B)\\ = Ai for all p G (0, 1). Hence, it holds for g(p) = || V~ p .S0 p ; B)\\ 
that g(0) < Ai and g(p) = X\ for all p G (0,1). But this is a contradiction 
to the fact that g(-) is continuous. Hence, a nonactive (i.e., zero) compo- 
nent j with ||Vg.iS r (/3;.B)|| < Ai cannot be active (i.e., nonzero) in any other 
solution. □ 

Proof of Lemma 1. The result easily follows from Lemma 8.4 in [29], 
which we cite here for completeness. 
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Lemma 3. Let Q be a collection of functions g: {x±, . . . ,x n } — ► R, en- 
dowed with a metric induced by the norm \\g\\ n = (h Ya=i 9 2 { x i)) 1 ^ 2 - Let 
H(-) be the entropy of Q . Suppose that 

H{8) < A5~ 2 ^- a ^> V<5 > 0. 

Furthermore, let E\, . . . ,£ n be independent centered random variables, satis- 
fying 

maxE[exp(X 2 /£)] < M. 

i 

Then for a constant cq depending on a, A, L and M , we have for all T > cq, 
( \2(e,g) n \ ^ T\ ( T 2 

M SU P II Ma > -7= < CO exp 2" 

Proof of Lemma 1. It is clear that {gj/I(gj)} = {gj ■ I{gj) = 1}- Hence, 
by rewriting and then using Lemma 3, 

|2(e,g 3 -) re | \2{e igj /I{g 3 )) n \ T 

SUp j. ,, ' . = SUp 7. 7— 7J. < — = 

with probability at least 1 — cq exp(— T 2 /cq). Thus, for Cq > 2cq sufficiently 
large 



\ 2 ( £ ,9j)n\ /logp 



71 



/ Cflogp\ / Cglogp 
< pc exp g — < c exp — g — 



\ Cq J \ ZCq 

In the same spirit, for some constant c\ depending on L and M, it holds 
for all T > ci, with probability at least 1 — c\ exp(— T 2 d/c\ ), 

l 2 (e>^')n| . ^ [d 
sup 11, 1/ — <T\ /-, 

where d is the dimension occurring in (6). This result is rather standard but 
also follows from the more general Corollary 8.3 in [29]. It yields that for 
C 2 > 2c 2 , depending on d, L and M, 



l 2 (e,^j)n| , n /logp 

maxsup — — — |f < C±\ 

3 hj \\hj \\n V n 

with probability at least 1 — c\ exp (-C 2 logp/ (2c 2 )). 
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Finally, it is obvious that for all C2 and a constant c 2 depending on L 
and M, 



'logp 



e>C 2 \/— ) <2exp(-C|logp/ C 2) 



Choosing c 2 > 2, the result now follows by taking C = max{Co, C±, C 2 } and 
c = c + C\ + c 2 . □ 

Proof of Theorem 1. We begin with three technical lemmas. 
Recall that (for j = 1, . . . ,p) 

^(/i)HI/X + A 2 -^ 2 (/;)- 



Lemma 4. For A > y/2£ n /rj, we have on S± n S 2; 

maxsup \ (2-^/2 \n - ?? - 
PROOF. Note first that with A > y/2£, n /i], 

< ^ - ^H^-fe - 5*) + ^Whj - h*\\ n 



V2 WJ * JUn ^ * 3 ' V2 1 

A(2-7)/2 . X 



\(2-7)/2 \(2~7)/2 

< fl—j^-y/V-rPbj -9*) + \\ gj -9*\\l + V—^-Whj - h* 



since A < 1 . 
We have 



^A2-7/2( 9j .- 5 *) + || 9j .- 5 *||2 + HZ,. _ fc*|| n 



< J2{A2-7/2 (5 . _ + y. _ g * { \2 + 11^. _ ^|| 2} 



V2J\^7p( 9j -g*) + II/,- -/„*||2, 



where we used the orthogonality of <?, — and /ij — hj. The result now 
follows from the equality — g*) = /(/,- — /,*)• □ 
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It holds that c = Y(= Y%=x Yi/n) and c* = E[Y]. Thus, on S,\c-c*\< f n . 
Moreover, 

11/ " f^Wl = \C ~ C*| 2 + ||(/ - C) - (/ a ° dd - C*)|| 2 . 

To simplify the exposition (i.e., avoiding a change of notation), we may 
therefore assume c = c* and add a £ 2 to the final result. 

Lemma 5. VFe Ziaue on S, 

\\f ~ f^fn + (1 " ^)A M/2 E^(/i " //) + A 2 " 7 E/ 2 (/,) 

i=i j=i 

< 2A ( 2 -7)/2 s rn (/ J -/; )+ A 2 ^ e / 2 (/;) + nr-/ a °ddii 2 +e 2 . 

jeA* jeA* 
Proof. Because / minimizes the penalized loss, we have 

i n V i n v 

-Y,(y - f{ Xl )f + e < - - /* (^)) 2 + E J (/;)- 

77 . . 1 77 . . 

1=1 J = l *=1 J = l 

This can be rewritten as 

11/ - /aVii^ + E j(fi) < 2( e , / - fin + E J (r ) + iir - f°^\\l 

3=1 3=1 

Thus, on S, by Lemma 4 

11/ - &d 2 n + E J (/i) < ^ (H/2 E^(/) - //) +E 

j=l j=l j=l 

* fO i|2 



or 



+ II/* ~~ /addlln 



ll/-/a°ddl| 2 + E A^Vn^ + A^E^l/i) 

^A^t^^-^+A^)/ 2 E K(/;)-r n (/,)) 

+a 2 ^ e ^(/;) + iir-/a°ddi. 
<(i+7 ? )A( 2 ^)/ 2 e r n {f3-r 3 )+^ )/2 E *■„(/,-#) 



+ A 2- 7 ^ / 2 (/;) + ||/*-/ a ° dd ||2. 

jeA. 
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In other words, 

ll/-/ a ddll^ + (l-r/)A^/ 2 £ r^/.O + A 2 -^/ 2 ^) 

jM. 3=1 

<(!+ ^ 2 rnUj-f^ + X 2 ^ E ^ 2 (/;) + lir-/a°ddllL 

j'e.4* 

so that 

11/ " /a°ddl| 2 + (1 " V) E A^^TnC/j - /;) + A 2 " 7 E / 2 (/,) 
3=1 .7 = 1 



< 2A^)/ 2 E ^n(/i " //) + A 2 " 7 E ^(/^ + HZ* - /addlln- 



Corollary 3. On S, either 



□ 



(12) 



or 



(13) 



11/ - r iin + (i - r/)A( 2 -^)/ 2 e rntf- - /;) + a 2 -^ e i 2 (fj) 

3=1 3=1 

<4a( 2 -^)/ 2 e^(/ j -/;) 



11/ - r IIS + (1 - r/OA^ 2 E^(/i - /*) + a 2 -^E' 2 (/;) 

3=1 3=1 

< 2A 2 ~ 7 E /2 (//) + 211/* " /a°ddllS + 2& 
3'e.A* 



Observe that if (13) holds, we have nothing further to prove, as this is 
already an oracle inequality. So we only have to work with (12). It implies 
that 

(14) E4 - /;) < ^- e rjSi - /;), 

3=1 ' jeA* 

in other words, we may apply the compatibility condition to / — /*■ 

Lemma 6. Suppose the compatibility condition holds. Then (14) implies 

4A (2-7)/2 £ Tb( /. _ /;) < 24^^-1 + A 2-7 E (I 2 (f)) + ^ 2 (/;)) 

+ 11/ ~~ /addlln + II/* — /addlln 

(under the simplifying assumption c = c* =0). 
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PROOF. We have 

4A (2-7)/2 £ Tn( /. _ /;) 

j'eA, 



The compatibility condition now gives 

4 A (2-7)/2£ Tn(/j _ /;) 



< 



E(/i-/;) 



i=i 



+ 2A2-7 ^ P(f)-f*). 
jeA* 



With the simplifying assumption c = c* = 0, we may use the shorthand no- 
tation / = J2j fj an d /* = J2j fj- Next, we apply the triangle inequality: 



l\\f-f*\\l + 2X^ J2 i 2 (fi-f;) 

jeA* 

< 11/ _ /add I In + II/* ~~ /addlk 



+ 2A^7 £ 2A 2- 7 ^ J 2(/ * ) 



We now use 



■II/-/, 



< 



4A 2 "T|A 



addlln ^ A2 
ii.* 



+ 11/ — /add 1 1 ii 



and similarly with / replaced by /*. In the same spirit 
4A( 2 -^)/ 2 ^A^ 



< 



4>n,* \ 

8A 2 ^|A 



b 2 

y n,* 



/2A 2 "7 J2 P{fj) 
jeA, 

+A 2 -^/ 2 (/ i ; 

jeA* 



and similarly with / replaced by /*. □ 

PROOF of Theorem 1. By Lemma 5, we have on S, 

ii/ - / a °ddii„ + (i - # M/2 E^(^ - /;) + A 2 " 7 ^/ 2 (/ j 
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< 2A ( 2 -7)/2 r n (/, -/;)+A 2 ^ £ / 2 (/;) 

+ II/* — /add I In + £n- 

In view of Corollary 3, we can assume without loss of generality that (12) 
holds. Lemma 6 tells us now that 

11/ " /a°dd \\l + (1 " V)^~~ <)12 t Tntfi ~ //) + ^ E l2 (ti 

3=1 3=1 

< + £ + \\\f~ /a°ddlln + III/* - /addfn 

This can be rewritten as 

11/ " /£ttlln + 2(1 - V)^ )/2 t Tntfi ~ fj) + A2 " 7 t ^(A) 

\2— -y I y| I 



<24— J^ + 3||r-/ a ° dd ||2+3A 2 ^ E I\f 3 *) + K 

< ™>* jeA* 



A.l. Proof of Theorem 2. We first show that the || • ||-norm and the 
|| • || n -norm are in some sense compatible, and then prove the same for the 
norms r and T n . 

Lemma 7. Suppose the theoretical compatibility condition holds, and that 

Then on 1S4, for all f satisfying 

vtot(/) < c v T m (f), 

we have 

ii/ii 2 <2ii/n 2 +(i+0 2 ) y: a 2 ~ 7 / 2 (/j). 

3^A, 

Proof. 

12 < 11/11= +C A 1 ^r 2 ot(/) 
<\\f\\l + C,cl\ l -'rl{f) 
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<\\f\\l + § E(II/#+ a2 ~ 7j: U)) 
^ll/ll^ + ^ll/f + ^E^/ 2 ^). 

Lemma 8. On £/ie se£ 54, and for A 1 " 7 ^ < 1, i< ZioWs i/iai 
(1 - A 1 - 7 C )r(/ J ) < rnC/,-) < (1 + A 1 - 7 Co)r(/ J ) 

/or j . 



Proof. 

W/;)-t(/,)|< 



< 



^-ll^ll 2 ! 

A^Or^/;) 



We use the short-hand notation 

Tm(f)= E T «(/j)' fout(/)= E T ™(/i) 

j'eA, HA, 

and 

Ttot(/) = T in (/) + T out (/). 

Proof of Theorem 2. If 

4 

rtot(/) < T in (/), 

1 — 77 

then by Lemma 8, on £4, 

4(1+7?) 

Moreover, on 1S4, for all j 
Hence, 

E H/illn< E Vit + wlU) 

j£A* jeA* 

= (1+17) E ii/#+^ 2 ~ 7 E /2 (/,)- 
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Applying the theoretical compatibility condition, we arrive at 

E \\m 2 n< il P 1 (\\f\\ 2 + ^ E / 2 (/i)W 2 --? £ i\f 3 ) 



jeA, ^* x jeA, 7 jeA* 

Next, apply Lemma 7 to obtain 

Ell/ill^^^ll/Hn 
jeA* v * 

+ ((i + ,)(i + <&) + fi±3> + a 2 " 7 £ j 2 (/,; 



2 t + A 2 - 7 EA/,)). 



□ 



APPENDIX B: THE SET 5 4 

In this subsection, we show that the set 1S4 has large probability, un- 
der reasonable conditions (mainly Condition D below). We assume again 
throughout that the functions fj are centered with respect to the theoreti- 
cal measure Q. (Our estimator of course uses the empirical centering. It is 
not difficult to see that this difference can be taken care of by adding a term 
of order in the oracle result.) 

Let /i be Lebesgue measure on [0, 1], and let for fj : [0, 1] — > M, 



/ 2 (/,) = /l/j S) | 2 ^=ll/j 



I A" 



where || • || M denotes the L2(//)-norm. Moreover, write Tj = {fj -I(fj) < 00}. 
We let 

1 

a = l 

2s 

and 

2(1 -a) 

7 : 



2-a 



as before. 
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We will use symmetrization arguments, and therefore introduce a Rademacher 
sequence {dj}, independent of {Xj}. 

The argumentation we shall employ can be summarized as follows. By 
a contraction argument, we make the transition from the / 2 's to the f's. 
This step needs boundedness of weighted /'s, because the function xi->x 2 
is only Lipschitz on a bounded interval. The fact that we use the Sobolev 
norm as measure of complexity makes this work. The contraction inequality 
is in terms of the expectation of the weighted empirical process. We use a 
concentration inequality to get a hold on the probabilities. 

The original f's are handled by looking at the maximum over j of the 
weighted empirical process indexed by Tj ■ This is done by first bounding the 
expectation, then applying a concentration inequality to get exponentially 
small probabilities. This allows us to get similar probability inequalities 
uniformly in j 6 inserting a logp-term. We then rephrase the 

probabilities back to expectations, now uniformly in j. 

To establish a bound for the expectation of the weighted empirical process 
indexed by J-j with j fixed, we first prove a conditional bound involving the 
empirical norm, then a contraction inequality to reduce the problem of this 
empirical norm, involving the fjS, to the problem involving the original 
fj's. We then unravel the knot. 

We now will present this program, but in reverse order. 

B.l. Weighted empirical process for fixed j. We fix an arbitrary j £ 
{1, . . . , p}, and consider the weighted empirical process 

A (2-7)/2 r(/i ) • 

Our aim is to prove Corollary 5. 

The following lemma is well known in the approximation literature. We 
refer to [29] and the references therein. For a class of functions Q, we denote 
the entropy of Q endowed with the metric induced by the sup- norm, by 

#oo(-,0). " 

Lemma 9. For some constant A s , we have 

H^S, {1(f)) < 1, 1/,-U < 1}) < ^" 2(1 " Q) , <5 > 0. 

Let for all R > 0, 

T 3 {R) = {/(/,) < 1, l/.loo < 1, 11/,-H < R}. 

The next theorem is along the lines of, for example, [31], Corollary 2.2.5. 
It applies the entropy bound of Lemma 9. We have put in a rough but 
explicit constant. We write Ex for the conditional expectation given X = 
(X±, . . . ,X n ). 
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Theorem 3. We have 



E-5 



sup 



i=i 



where 



R n = sup ||/j||„. 



To turn the bound of Theorem 3 into a bound for the unconditional 
expectation, we need to handle the random R n . For this purpose, we reuse 
Theorem 3 itself. 



Theorem 4. We have 



A n2q(2- 7 )/2 



Proof. By symmetrization and the contraction inequality of [18], 



E 



SUP Hl/illn- ||/il 



< 8E sup 
\ n 



i=l 



where we used Theorem 3. It also follows that 

E[i^]-i? 2 <2 7 ^EK] 



Since by Jensen's inequality 

E[^ ]=E[( ^)2/a ] > (E[j Ra ]) 2/c 

we may conclude that 

(E[R"]) 2/a <R 2 + 2 7 ^E[R«}. 



Now, for any positive a and b, 

ab<a 2 '^ + b 2 ' a , 

hence, also 



Apply this with 



a5 < 2 a/(2-a) a 2/(2- a)+ l 6 2/^ 



= 2 7 A b = E[R%) 

in 
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to find 

(nK]f ,a <r 2 + 2 a i^- a) ( 2 7 4^ 2/(2 a) + -mfc\) 2,a . 

\ V n J 2 

It follows that 

/ A \2/(2-Q) 

(E[^]) 2 /°<2 J R 2 + f2 8 -=' 



and hence 



A \2a/(2-a) 



A x2a(2- 7 )/2 



□ 



Corollary 4. We have 



E 



sup 



i=l 



< 



2 4 A 



n 



(2i? 2 ) Q + (2 8 ^=j 



4 x2a(2- 7 )/2 



for some constant A s depending only on a = a(s) and A s . 

The peeling device is inserted to establish a bound for the weighted em- 
pirical process. 



Lemma 10. Define 

Then for A > 5 n , 

E sup 



5 n = {A s /^). 



i(/i)<i,l/il<»<i A( 2 -7)/VII/jll 2 + A 2 ~7 



where 
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E 



sup 

/(/i)<l,l/ 3 -|oo<l A(2-7)/2^||/ y ||2 + A 2- 7 



< E 



sup 

/je^f(A( 2 -T)/ 2 ) 



oo 



|l/nE?=i^/j(^. 

A2-7 



sup 



|l/nEr=i^(^ 



(3^ 



3 =\ J= l n 



< ( 2 +2zx:^ i(i " a) )T = f 2+2 - 



3=1 



A 



-a/(l-a) \ § 



1 — a / A 



□ 



We now show how to get rid of the restriction \fj\oo < 1 in Lemma 10. 
Lemma 11. Define 



6 n = A s /y/n. 



Then for 5 n < A < 1, 



E 



ll/nELi^PQ 



sup 

/(/,•)<! A( 2 -7)/2./||/,||2 + A2-7 



A ' 



where 



Proof. We can write fj = gj + where hj is a polynomial of degree 
s — 1 and |<7j|oo < -f(ffj) = I(fj)- We take 5^ and ^ are orthogonal: 



y ^j/ij dQj = 0. 



Then 



ll/nELx^/i^)! ^ |l/nEtl^(^)l , 1 1/n E?=i 



^(3)1 



AM/VH/3II 2 + A2 " 7 A(2 ~ 7)/2 vll*'H 2 + A2 ~ 7 

We moreover can write 

s-l 



A( 2 -T)/2||/ lj . 
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where the {pk} are orthogonal polynomials, and have norm \\pk\\ = 1- Hence, 
using that ££=1 Q\ = \\hj || 2 , 



< 



s-l 
\ k=l 



t=i 



This gives 



E 



sup 

hi 



A (2 — ^)/2|| /i 



0^ 



since 



\/n<?>n = A S >1. 

Using the renormalization 

fr - J) IU)) 

we arrive at the required result: 

Corollary 5. We have 

\l/n^Ua i f 1 {X^] 



□ 



E 



sup ■ 



h A(2-7)/2^||/ i ||2 + A 2- 7j 2 (/ . ) 



<C a 



A 



B.2. From expectation to probability and back. Let Q be some class of 
functions on X, Ci)---)Cn De independent random variables with values in 
X, and 



Z = sup 



1 n 

-£GK6) -%«*)]) 

1=1 



Concentration inequalities are exponential probability inequalities for the 
amount of concentration of Z around its mean. We present here a very tight 
concentration inequality, which was established by [4]. 

Theorem 5 (Bousquet's concentration theorem [4]). Suppose 
1 n 

- £e[G7(C0 - ng(Q)]) 2 ] <R 2 Vg e a, 



i=i 



and moreover, for some positive constant K , 

\g(Ci)-M[g(Ci)]\<K Vg G Q. 
We have for all t > 0, 



3ra 



< exp(— t). 
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Corollary 6. Under the conditions of Theorem 5, 

( 2tK I 2t \ 

(15) P Z>4E[Z] + + R\ — <exp(-t). 

\ 3n V n I 



Converse, given an exponential probability inequality, one can of course 
prove an inequality for the expectation. 

Lemma 12. Let Z > be a random variable, satisfying for some con- 
stants C\, L and M , 

F[Z>C 1 + — + M\ — ] <exp(-t) Vt>0. 
\ n V n } 



Then 



Proof. 



roo poo 

E[Z]= ¥(Z>a)da<Ci+ P(Z>Ci + a)da. 
Jo Jo 

Now, use the change of variables 



Lt m 

a = \-M\ — . 

n V n 



Then 



So 



da 



L M 
- + -= ) dt. 

n y/2nt 



J poo TUl poo 

E[Z]<d + - e- t dt + ^=i e~ l Ntdt 



'2n Jo 



n Jo 



Lemma 13. Let, for j = 1, . . . ,p, Qj be a class of functions and let 



□ 



Zj = sup 



i=l 



Suppose that for all j and all gj € Qj, 

\\9j\\<R, \9j\oo<K. 
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Then 



max Zj 

i<j<P J 



E 

Proof. Let 



<TA mrv1 , 2K(l + logp) 4(l + logp) 

< 4 max fc i,- H h -fil/ . 

~ i<i<P 3 3n V " 



Then by the corollary of Bousquet's inequality, we have 

P| Z,- > + — + .Ra/— ] <exp(-t) Vt>0. 
y 3n V n / 

Replacing i by t + logp, one finds that 



J V^A F i 2 ^ i 4 ^ lQ g^ I p [%t 2lQgp 

IP max Zj > 4 max A .,- H 1 h /ty h /tW 

I i j 3ra 3n V n y n 

< pexp[—(t + logp)] = exp(— t). 

Apply Lemma 12, with the bound 7r/4 < 1, and with 



n A tp , 2ifrogp , D /21ogp 

Ci = 4max£j H h R\ , 

j on y n 

L=™, M = R. n 

B.3. The supremum norm. The following lemma can be found in [29]. 
It is a corollary of the interpolation inequality of [1]. 

Lemma 14. There exists a constant c s such that for all fj with I(fj) < 1, 
one has 

\fj\oo — Cs 1 1 fj 1 1 ^ ■ 

Condition D. For all j, dQj/du = qj exists and 

Oj > nl > o. 

Corollary 7. Assume Condition D. Then for all j and all fj with 
I{fj) <1, we have 

\fj\oo — c s,q\\ fj || j 

where c s<q = c s /t]q. This implies that for all j and fj, 

ifjioo^csjfjiri 1 - 01 ^). 
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Lemma 15. Assume Condition D and that A > \/4(l + \ogp)/n, and 
5>n < A < 1 . We have 

ii/™e?=i^/;(*P ; 



E 



max sup ■ 

j Si 



A(2-7)/2 r(/i ) 



<4C\^ + c s , g A + A^ 2 . 



Proof. By Corollary 5, we have for each j 

ll/nEEWiC*^) 11 



E 



sup ■ 



A(2-7)/2 r(/i ) 

Moreover, in view of Corollary 7, 



<c.£. 



J |oo 



< 



-s,q 



We also have 



A(2-7)/2 r(/i ) " A 
ll/ill 



< 



1 



A(2-7)/2 r (/ J .) " A( 2 -^)/ 2 ' 

Now, apply Lemma 13 with 

Co.rt ^ 1 



R 



A(2"7)/2 ' 



to find 



E 



max sup 

j Sj 



A(2-7)/2 T (/.) 



~ <5 n ^(l + logp) 1 / 4(l+logp) 



□ 



B.5. Expectation of the weighted empirical process, indexed by the ad- 
ditive fs. 



Lemma 16. Assume Condition D and that A > \/4(l + \ogp)/n, and 
$n < A < 1 . Then 

|l/n£?=iffi/(*i) 



E 



T AM/2 Ttot(/) 
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Proof. It holds that 



3=1 



1 n 

i=l 



Hence, 



IE 



sup 



|l/nE?=i*i/«l 



7 A(2-T)/2 rtot(/ ) 

<E 



8UP jrWnZL^fM': 



< E 



:E 



[„f h no,(/) £ A(2-.)/2 r(/j ) 



sup 



■ max sup ■ 



L/= E /,rtot(/) i / A(2-7)/2 T(/i ) £ 
|l/n££Wi(*, W ? n 



max sup ■ 

J / A(2-7)/2 T (/.) 



<G^ + Csi(7 A + A^ 2 



□ 



B.6. Expectation of the weighted empirical process, indexed by the ad- 
ditive / 2 's. 



Lemma 17. Under Condition D, 

1 32 1 



E 



sup 



/ 7tot(/) 



< 8c, „A" 7/2 E 



sup ■ 

/ 



IV«E?=i^/PQ)l 



Ttot(/) 



Proof. By a symmetrization argument (see, e.g., [31]), 



E 



sup 



< 2E 



sup ■ 

/ 



ll/nELi^/Wl 



T r tot(/) 

Because for all j, 

we know from Corollary 7 that 

|/iloo<P. ig A-T/ 2 r(/i)- 



HIGH-DIMENSIONAL ADDITIVE MODELING 



41 



Hence, 



oo J=l j=l 



A~ 7/2 r(/). 



Let if = c Sj(J A~ 7 / 2 . Now, the function x i— > x 2 is Lipschitz on [—if, if], with 
Lipschitz constant 2if . Therefore, by the contraction inequality of Ledoux 
and Talagrand [18], we have 



E 



sup 

/ 



<4ifE 



sup 

/ 



|l/n£?=iffi/(*i) 



Ttot(/) 



□ 



Corollary 8. Using Lemma 16, we find under Condition D, and for 
5 n < A < I, A > V4(l+logp)/n, 



E 



SUp- 2 



1 <8c s ,,A 1 " 7 fc' s ^ + 



B.7. Probability inequality for the weighted empirical process, indexed 
by the additive / 2 's. We are now finally in the position to show that £4 
has large probability. 



Theorem 6. Let 



Z = sup ■ 
/ 



r 2 (f) 



Assume Condition D, and 5 n <\<\, A > ^/4(1 + logp)/n. Then 



Z > c^A 1 " 7 [ 2 7 C S 4- + 32A + 32A 7 / 2 + V2l 1 + — ^ 



A 



4cLA 2 ( 1 " 7 )t 



<exp(-nA 2 - 7 t). 
Proof. We have 



I/ 2 



r 2 (f) 



< <^ 



and 



and 



r 2 (/)- C ^ A r(/) 



<£il/ili <*-(/). 

i=i 
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So we can apply the corollary of Bousquet's inequality with 

K = cl q \^ 

and 

R = c s , q \-^ 2 . 

We get that for all t > 

Use the change of variable 1 i— ► nA 2_7 i, to reformulate this as: for all t > 

/ 4c 2 X 2 ^-^t \ 
F(Z> 4E[Z] + s ' q + c s>q \ l ~^V2t\ < exp(-nA 2 ~ 7 t). 

Now, insert 

E[Z] < Scs^X 1 ^ [aC s 5 -^- + c ss \ + X'l 2 

Remark 4. Recall that b n = A s /y/n. Thus, taking 1 > A > A s /^/n and 
A > \/4(l + log p)/n, we see that for some constant C s>g depending only on 
s and the lower bound for the marginal densities {qj}, and for 

C = C s>q (l + V2i+\ l - f t), 

we have 

P(«S 4 ) > l-exp(-nA 2 - 7 t). 
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